Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

نویسنده

  • Fariha Atta
چکیده

he Map/Reduce framework-a parallel processing paradigm-is widely being used for large scale distributed data processing. Map/Reduce can perform typical relational database operations like selection, aggregation, and projection etc. However, binary relational operators like join, cartesian product, and set operations are difficult to implement with Map/Reduce. Map/Reduce can process homogeneous data streams easily but does not provide direct support for handling multiple heterogeneous input data streams. Thus the binary relational join operator does not have efficient implementation in the Map/Reduce framework. Some implementations of the join operator exist for the Hadoop distribution of the Map/Reduce framework. However, these implementations do not perform well in case of heavily skewed data. Skew in the input data affects the performance of the join operator in parallel environment where data is distributed among parallel sites for independent joins. Data skew can severely limit the effectiveness of parallel architectures when some processing units (PUs) are overloaded during data distribution and hence take a greater time for completion as compared to other PUs. This also results in wastage of resources of the idle PUs. As data skew naturally occurs in many applications, handling it is an important issue for improving the performance of the join operation. We implement a hash join algorithm which is a hybrid of the map-side and the reduce-side joins of Hadoop with the ability to handle skew and we compare its performance to the other join algorithms of Hadoop.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

The Map/Reduce framework is a programming model recently introduced by Google Inc. to support distributed computing on very large datasets across a large number of machines. It provides a simple but yet powerful way to implement distributed applications without having deeper knowledge of parallel programming. Each participating node executes Map and/or Reduce tasks which involve reading and wri...

متن کامل

Sentiment Analysis of Social Networking Data Using Categorized Dictionary

Sentiment analysis is the process of analyzing a person’s perception or belief about a particular subject matter. However, finding correct opinion or interest from multi-facet sentiment data is a tedious task. In this paper, a method to improve the sentiment accuracy by utilizing the concept of categorized dictionary for sentiment classification and analysis is proposed.  A categorized dictiona...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Adaptive Join Plan Generation in Hadoop For CPS296.1 Course Project

Joins in Hadoop has always been a problem for its users: the Map/Reduce framework seems to be specifically designed for group-by aggregation tasks rather than across-table operations; on the other hand, join operation in distributed database systems was never an easy task because data location and skewness makes join strategies harder to optimize. Fragment-replicate join (map join) may be a cle...

متن کامل

A Scalable and Skew-insensitive Algorithm for Join Operations using Map/Reduce Model

For over a decade, Map/Reduce has become a prominent programming model to handle vast amounts of raw data in large scale systems. This model ensures scalability, reliability and availability aspects with reasonable query processing time. However these large scale systems still face some challenges : data skew, task imbalance, high disk i/o and redistribution costs can have disastrous effects on...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010